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administered to° two different saaples reveals the possibility of a 
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Calibration. OF an Item Pool' for the 
Adaptive Measurement of Achievement 



-- ^ ^ 

4 The malority of research in adaptive testing to date has been/con:- 
cerned with ability testing (Weiss, 1973, 1976) ♦ Very little adaptive test- 
ing research has addressed itself Co the unique problems of achievement ^ 
measurement (Weiss, 1973, pp-. 40-41)'. Although frequently treated as if they 
are highly similar in approach (e.g., English, Reckase , & Patience, 1977)., ^ 
the adap'tive measurement of ability and achievement can present quite differ-, 
ent problems. These differences ^arise, in part, from the different kinds of 
^tem pools which are available for the measurement of ability vs. achievement. " 

In the measurement of a])ility, the test consttuctor defines -the nature 
of the item pool. Once the ability domain is specified^ la^*ge numbe^rs of ^test 
items can be generated; and the item pool can be defined to have whatever ' . 
characteristics are deemed by the. test^ constructor to be' psychometrically 
, desirable. Thus, ability tests can be^designed to be unldimetisional by 
eliminating from the item pool those items which measure extraneous dimensions. 
Similarly, if an item pool is being developed for adaptive lasting, the ^ 
ability test constructor can construct a unidimensional pool which consists 
of items with a wide range of difficulties and high discriminations (e.g., 
McBride Weiss, 1974). Based. on the availability of such a pool, ther.e is 
little question of the applicability of such unidimensional models as those 
from latent trait theory (eig.. Lord & Novick, 1968) or the strategies of 
adaptive testing which have been designed to measure individual differences 
wifjhin a unidimensional framework (Weiss, 1974). 

In most practical achievement testing settings, however, test constmc- 
tors do not haye the freedom to contract the kinds of ideal item pools that 
are possible in ability measurement. In the achievement testing environment, 
where the purpose is to measure what students have learned as a result of 
some instructional exposure, the nature and extent of an item pool is largely 
dictate(J'by the content covered in the course. Thus, a course might convey 
information on a variety of topics which are part of the larger content 
area defining, the course but are not so highly corr^ated with each other 
that they can be considered to be one dimension. Similarly, because these 
separable content areas may be limited in scope. If may not be possible for 
the test constructor to generate large numbers of test items in each content 
area or to generate a pool of items large enough to meet the requirements 
of some adaptive testing strategies. ^ 

Since adaptive testing in the ability domain has been shown to have 
considerable promise (Lord, 1977; Urry, 1977; Weiss, *1976) , it is appropriate 
to determine whether it will be similarly useful in applications to" the unique' 
problems of achievement measurement. However, because of the, differences 
in the characteristics of the item pools, it is necessary first to examine 
typical pools of achievement test items; in this way it can be determined 
whether they can meet the criteria necessary for the implementation of 



currently available adaptive testing n\odels or whether new models will be 
required to implement the adaptive measurement of achievement. This report 
is addressed to that question. " ^ 

Altexmative Psychometric Bases fov^Adaptive Testing 

There are three general psychometric models on which the adaptive 
measurement of achievement can be based: classical test theory (Gulliksen, 
1950), order theory (Cliff, 1975, 1976), and item characteristic curve (ICC) 
theory, (tord, 1974)/ j 

Classical test theovii. . In general, classical test theory cannot provide 
an adequate psychometric framework ^or an adaptive achievement testing 
systism. The objective of .an adaptive testing system is to individualize the 
test for each^ testee by selecting test items on the basis of the testee^s 
responses to previously administered items. As a result, different testees 
respond to different items. Since classical test theory uses as its scoring 
system the total number of correct answers to test items, testees of different 
levels of achievement will be Indistinguishable from on^ another if their 
^adaptive tests are scored in this way. 

The only method that classical test, theory has at its command for. 
dealing with an incomplete response matrix is multiple-'matrix sampling (Lord 
& Novick, 1968). However, although this technique is designed to estimate 
the mean achievement Ifevel^ of persons in a group, it cannot; efficiently 
estimate an individual's achievement score (tord, 1977). ^ Furthermore, matrix, 
sampling assumes that each individual in the sample takes a goup of items 
selected at random from the pool. This assumption runs counter to the 
philosophy of adaptive testing in whichj the objective is to select items for 
each testee in a deliberately non-random manner. 

Order theory. One method to circumvent the problems caused by different . 
-persons completing different test items is called order theory (Cliff, 1975, 
1976). This theory. is based on the formation of a triangular matrix which 
orders individuals using their responses to some subset o^ itemfe from an item 
pool. One assumption of order theory is that all' items are Guttman items, 
i.e., items which are perfectly discriminating. However, although this' 
assumption will yield greatly reduced test lengths, it is doubtful that 
Guttman items will appear in typical achievement testing situations. By' basing 
its procedures on Guttman items, order theory also makes very strong assump- 
tions about unidimensionality — considerably stronger than those made by either 
classical test theory or ICC theory. Order theory as a general system for 
the measurement of individual differences is quite new, and many'* of its basic 
problems and procedures have yet to be adequately articulated. Perhaps 
at a later date it will become a useful system for the adaptive measurement 
of achievement. ^ 

. * . ^ 
Item characteristic curve theory. item characteristic curve (ICC) theory 
item response theorj, which has been used to provide a psychometric basis 
for the adaptive measurement of ability (e.g.. Lord, 1976; McBride & Weiss, 
1974; Urry, 1976; Vale & Weiss > 1975a, b), may also provide an appropriate 
model fo^ the adaptive measurement of achievement. 

8 . 



^ Two properties of ICC theory are especially relevant in this context, 
F^t, oICC theory provides a aearvs for obtaining scores on the same metric 
^fApetsons who have completed different test itfems. As indicated earlier, 
thiS^ls an essential requirement for adaptive tests* Second, under the assumptions 
of ICC theory, the resulting score metric is invariant with respect .to 
population* Thus, if a set of data from a given group of test^es can be 
shown- to meet the assumptions of ICC theory, it is possible to score all 
individuals on the same equal interval scale regardless of the subgroup of 
the population to which they belong. . - • ^ ^ ■ , 

With these two advantageous properties, ICC theory provides the promise 
.of measurement which is not dependent upon either the set of test items a 
•person has answered or his/her population subgroup membership* There is, in 
addition, a third advantage of ICC theory: it provides a flexible psychometric ' 
framework for the development of criterion-referenced achievement tests. As 
Hambleton & Cook (1977) note, there is likely to be a great degree of homogeneity 
among items covering a single criterion-referenced instructional objective. 
As a result of this homogeneity, th6 basic assumption of unidimensionality 
i:equired by ICC models is very likely to be satisfied. , . 

Because of the degree of articulation of ICC theory and the^ development 
of means for its implementation, it appears to be ,a viable approach to the 
adaptive measurement of achievement. Furthermore, it is possible to test, 
the ^it of a set of. data to the theory prior to its us^ for the development 

of an adaptive testing systeu. 

« • , I. 

Objective " . ' - 

— . • 

y Within the context of a practical achievement testing problem, this ^ 
report is concerned with the applicability of ICC theory to the measurement 
of achievement. Specifically, its purpose is to 1) evaluate the fit of the 
item characteristic curve model to items on a multiple-choice achievement 
test; 2) investigate the dimensionality of ^an achievement test item pool with 
respect to the unidimensionality assumption of latent trait theory; and 3) 
determine whether the" item parameters of ICC theory, within, the context of an 
achievement test, are invariant across different subgroup^s from a population. 

* The Achievement Measurement Context * 

The Course and Examination Procedures 

"This study used data from Biology 1-011, an introductory biology course 
open' to all students at the University of Minnesota. Both majors and non- 
majors in the natural sciences enroll ip this course. Biology 1-011 is 
offered every quarter. Quarterly e,nrollment ranges from 1000 to 1500 students, 
with the fall quarter tending to have the highest number of students. 
Students are generally freshmen, but a substantial number of sophomores and 
a few juniors and seniors enroll in the course. The , sexes are about equally, 
represented. According to the course staff, there seek to he, no important 
changes in the demographic composition of the 'student body from quarter to 
quarter. Instruction in the course is by means of videotaped^ lectures which 
are shown on closed circuit television. The lectures do not change from 
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quarter to Quarter but are revised every two years, 
lectures i/xhere is a compulsory laboratory* .^^^ 



In addition to the 



Students are given two tnidquarter examinations and a final exatnin^ation 
eactt quarter. All examinations use multiple-chbice items. The fir^f'mid- 
barter examination includes 55 questions and each student is required to , 
answer only 50 of them. It covers the areas of 1) chemistry, 2)^^the cell, 
and 3) energy. The second midquarter examination also includes 55 questions, 
of Which 50 must .be -answered. It covers two additional content areas: ' 
4)"* genetics and 5) .reproduction and embryology. The final examination 
includes 110 items, of which only lOO^must be answered.. It'covtirs the five 
previous content areas plus two additional ones: 6) ecology and 7) evolution. 



/ 



Table 1 



/ 



Content Areas and Item Number Ranges 



Content Area 
Number 



Content 



Item Numbers 



1 


Chemistry" 


pOOO-3200 


2 


The Cell/ 


3201-3400 


3 


Energy / 


3401-3600 


4 


Heredity/Genetics 
Reprpduction and 


3601-3800 


5 






^^Embryology 


3801-4000 


6 


Ecology 


4001-4200 


7 


Evolution 


4201-4400 



The Item Pool 

The basic item pool for this study consisted of item responses on the 
two midquarter examinations and ,the final examination for winter and spring 
quarters of 1976. Items were classified by content areas; items in each 
content area were assigned numbers within the range shown in Table 1. 











Table 2 








Number of 


Items 


in the 


Item 


Pool by Test 


and Content 


Area • 








Content 


Area 








Test 


1 


2 


3 ' 


4 


5 


^ 6. 


7 


Total 


Wl 


21 


22 


12 










55 


SI 


19 


J 25 


11 










55 


W2 








36 


19 






55 


S2 


2 






35 


18 






55 


WF 


9 


14 


7 


18- 


9 


■ 28 


•25 . 


110 


SF 


9 


12 


6 


17 


11 


■ 30 


25 


110 


Total 


60 


73 


36 


106 


57 


58 


50 


440 


Unique 


53 


60 


33 


101 


48 


52 


47 . 


394 
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Table 2 shows the number of items in the item, pool by source and content 
area* In the first column of Table 2, the letters S and W refer to spring 
^ and- winter quarters,, while 1, 2, and F refer to the test from which the items 
were taken: ^ the first midquarter, the second midquarter, and the final 
examination, respectively. . Since some of the items were repeated between 
the two quarters, Table 2 also sTiows the number of unique items in each 
content area. The repeated items were used to t^st the invariance assumption 
of ICC theory across population subsamples. 

Table 3 sho\7S the number of unique items obtained from each of the exams 
and the average number of testees who answered each of these items in the 
tests used for calibration of th'e item pool. 



Table 3 

Number of Unique Items and Average 
Number of Testees for Each Test 



Test 


• Number of 
Unique Items 




Average Number 
of Testees 


Wl 


4"8 




998 


SI 


46 




838 


W2 


52 




.934 


82 


48 




760 


WF 


99 




888 


SF 


101 ■ 




638 



The initial goal of thesja analyses was to form two item pools^for later 
adaptive testing researcn. Each of these pools, was to be designed^for use 
with one of the midquarter examinations. The dimensionality an<^lyses reported 
below are. thus^ confined to these midquarter item pools. The applicability 
analyses and the 'invariance analyses, however, utilized items from the final 
examinations. 



Applioability of the ICC Model 

An initial question to be answered in the , use of ICC theory in a multi- 
content achievement test is whether application^ of the procedures of the 
unidimeusional ICC model to such test items would yield estimaces of Item 
parameters which would be useful for adaptive testing.* Since adaptive 
-tests" function best when items spun a wide range of difficulties and have 
relatively high^discriminating power (Urry, 1976; Vale & Weiss, 1975b), 
it is possible that typical achievement test items might not meet even 
these minimal requirements, for example, it is possible that because of the 
varying-content in th^e item pool, item discriminations would be so low as 
to indicate a great deal of heterogeneity in the test items. ^Therefore, the 
first set of analyses of the item pool involved the determination of item 
parameter estimates for each item in the pool and the examination of the 
resulting estimates with regard to their utility for the construction of 
adaptive tests.' 
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The ICC Model 

^ * 

Because the items were multiple-choice, a three-parameter ICC modfel 
for aithotomous item responses was appropriate. This model has been described 
in detail by Hambleton & Cook, 1977; Lord & ITovick, 1968, Ch. 1); and McBride 
& Weiss, 1974. The model assumes that the item characteristic curve for an 
item can be completely described by three parameters: a, the discriminating 
power of the item, which is proportional to the maximum slope of the ICC at 
its p6int of inflection; 2?, the item difficulty, which specifies the location 
on the underlying ttait continuum at^the point of inflection of the iCC; and 
c, the "guessing" parameter, which is the probability of a correct response 
to the item for a testee of infinitely low trait level and is sometimes 
described as the probability of a correct response by random guessing. 

Estimation of Item T avametevs ^ 
1 ■ ' « 

Fvoceduve. The process of estimating item parameters in ICC test 
theory is essentially a curve-fitting procedure. An item characteristic 
curve is fit for each item based on the item responses' of ^ group of testees. 
Because "best fit" may be defined in several ways, there are different 
estimation procedures (see Hambleton & Cook, 19/7, p. 89). The procedure 
used here was based on a logiscic ICC model using a minimum definition 
of fit, as operationalized in Urry's ESTEM program (see Urry, 1976, p. 99). 

As defined by Urry, the best-fitting curve is the one that minimizes 
the criterion • , ,^ 

m-X 



where r ♦ = the number of testees at score J, who correctly answer item ^, 

J ' . * 

n . = the number of testees who obtain a score of ^ 

^J'(j) is the expected proportion of correct 'responses to item 
9 " ' ^ " - 

among those with a score of 

• * «>•) = fl"P^(j)], 

m is the number of items in the test. 

Urry*s computing algorithm consists of two stages. During the first stage, 
for a given item the procedure increments the value of o (the guessing 
parameter) 'from .02 to .30. At each increment, values of a and b consistent 
with a are found. That is, several trial ICC*s are generated. Then, .for each 
of these trial ICC's, Equation 1 is computed. The parameters corresponding 
to the equation that yield a minimum value of ^re taken as initial estimates. 
These estimates are refined by a method known as ancillary estimation, 
which was developed by Fisher (1950). They are refined further 
at the second stage., which is identical to the first, except that a .Hayes 

12 . 



modal estimate of trait level (Same jima, 1969) is used as the metfric, 
'rather than the standarized raw scores used in the first stage* 



Evaluation of the estimation pyjooeduve. The accuracy^and efficiency 
of the ESTEM program has been tested in computer simulations with synthetic 
data (Gugel, Schmidt & Urry, 1976; Urry, 1976), using sample, sizes ranging 
from 300 to 3000 and test lengths. ranging from. 50 to 100 items.. In these 
studies two criteria have been "used in evaluating the estimates yielded. by 
the program. The first evaluative criterion was the root mean sc^uare (RMSE) 
which was detlned -as- - - ^ ^ • 

RMSE= 1 ^ %^ ^ ^ 

• * * * ' 

* ' - * - th ^ ' 

where a -is an estimated parameter value for the a item, 

- - • ' , • ; \ * . <^ 

OC is the known parameter value from^ which the synthetic.data were 

generated, ^ ^ ^^ * ' ^ 

' ' '. 'n is. the numbeif- of items. ' " ^ \ ' ^ 

Their second evaluative criterton was simply the Pearson product-moment 
correlation betv^eeti the estimated parameter value and the known parameter 
value., * : , ' , , . J ' 

, .Root nean square error is a measure of the discrepancy between the valpe 
of the parameter estimate and xhe numerical value of the generating parameter; 
It includes Both sampling *f luctuations and bias. Its usefulness is limited 
to comparing estimates of. the same parameter across different situations , 
since it is scale dependent. , The correlation coefficient, on the other hand, 
is scale free and can be used in intra- as well, as inter-parameter comparisons 

^ The simulation studies by Gugel, Schmidt,, & Urry (1976) provide some. 
data with. which to evaluate the applicability of ESTEM^s item parameter^ 
estimation procedures for the data base available in the present study 
(i.e. , . testee groups of between 600 and 1,000 persons and test lengths of 
50 or 100 items). Table 4 shows results from the simulation, studies ot a 
50-ite'm test for 500 and 1,000 simulated testees. 

* 

. ' ' ^ * , Table 4 - , ' " . . 

' »* " * • • 

. RMSE and^ Correlation of Estimate arid Parameter Values for the ^ 

a^ b-and 6 Parameters for, 50 Items and Two Sample Sizes 

^ [From (gugel, Schmidt and Urry (1976)] ^_ 

T ^"^ RMSE ^ " Correlation 



\ N £ a b ' a b o 

500 .^72 .259 .077 .780 .989 .454' 

1000 • \ .326^ .209 .078 ^ .908 .9^0 .492 
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As Table 4 shows, for a SO-item test (similar to' the midq uarter 
examinations used in this study) more accurate estimates of the parameters 
were generally obtained with the larger group of sinrulated, testees* For 
example, the RMSE values for the final ^estimates of the a parameter were 
.4/2 for,/l^=500.and 1326 fox il7«l,000. ihe corresponding correlations were 
;780 and .908» The improved accuracy of estimation as N increased occurred 
for 'the b and q parameters as well* It should be noted, however, that for 
5p-i,tem tests for the two. sample sizes the b parameter is. very accurately 
estimated regardless of sample size, the a parameter is fairly well estimated,, 

and the o parameter is poorly estimated (r«=.454 and .492). 

0 

Tatle 5 shows the results' of the Gugel et al. simulation study corr^ 
espondihg' to the maximum sample size usedsin the present study </l?=l,00b) . . 
The test lengths in Table '5 vary from 50 to 100 to reflect the lengths, of 
the. midquarter an^ final examinations used here. As Table 5 shows, for a 
fixed number of persons , increases in- the^numh'er of items do not generally 
result in moire accurate parameter estinia|es.. Vor the b parameter, which' 
is very accurately estimated with 1,000 .c'afses, ther accuracy improves from 
r=.990 to .996. The a parameter, , which is- poorly estimated at /7=1,000, shows 
increases from r=^492 to ^.627. For the a parameter there is no clear trend 
in the correlations, with the highest ^accuracy at 50 it^ems (r=.908) and the 
lowest, at 60 items . (r=. 842) . The results for the three parameters, using Che 
RMSE criterion, show no clear trends either. 



Table 5 

RMSE^^nd Correlation of Estimate and Parameter Values for 
Parameters a, b and o for" a Sample Size of IQOO at Three Test Lengths 
[From Gugel, Schmidt and Urry (1976)] ' 



Number' of 
Items 




RMSE 


f ■ 


Correlation 




' a 


b 


0 


a 


b 


a 


.50 


.326 


.209 


.078 


»908 


.990 


.492 


60 


, • .322 


.144 


.062 


.842 


.995 v " 


..558 


■80 . 


..261 


.166 


.073 - 


.879 


.993, 


.550 


100 


.240' 


.162 


.062 


.863 


.996 ■ . 


.627 



The results from Table 4, together with those from Table 5, show that with 
the numbers of testees and numbers of items used in this study, the b para- 
meter (item difficulty) Ls very accurately estimated, while the a (discrimin- 
ation) and B (guessing) parameters are less well estimated by this procedure. 

Critevia for exalMing items. Urry^s item calibration program does not 
report ICC item parameters for an item if the calculated parameters meet 
any or the 'following criteria: 

> ' 1* a less than .80 ^ 

2* Z) less than -4.00 or greater than 4.00 . 
... -3. Q greater than .30. , ; 

These rejection criteria are applied to the items only in the first phase 
of the' calibration procedure.' Jhe final parameters of the xtfms that"^ are not 
excluded in the first phase are allowed to vary unrestrained in . the second 

< • ■ ■■ ' ■ ■ ■ - 14 ■ . V 



phase. of calibration. Those items that were rejected in the first phase of 
the program were excluded from further analyses* 

Results ' . / ' • ' • 

Excluded items. Table 6 shows the number and percentage of items in 
•each content area which did not meet the criteria. specif led by Urry's - 
calibration program. Of the 394 unique (i.e., non-repeated) items in the 
pool," 85 (or 22%) met one or more of Urry's exclusionary criteria. The. 
percentage of items lost by content area varied from 9% for content area 3 
/energy) to 33% for content area 6 (ecology). Almost without exception, the 
items which were excluded by the calibration program had very low point- 
biserial correlations with total score. This indicates thdt most of the 
rejected items were excluded because of low estimates. of the a parameter . 
for these items. . ' 



Table 6 



Number of 


Items 


Lost 


in the ( 


Calibration 


Process 






by Test and" Content Area ■ 












Content 


Area ■ 








TestV- -'V^ 


1 


2 


3 4 


5 


% 


7- 


Total 


Wl 


8 


5 


2 






4 


15 


SI 


4 


4 


1 








9 • 


W2 




i 


5 


6 






■ 11 - 


S2 


1 




' 4 


• 3 • • 






. 8 


WF , 


1 


2 


2 


1 


4 


4 


. 14 


SF 


2 


2 


2 


3 


1,3 


6 


28 


Total 


16 


13 . 


' 3 13 


13 


17 


10 


85 


Percent of 
















Unique Items . 


■ 30 


-22 


9 13 


27 


33 


21 


22 



Item pool ahamateH&Has, icc item parameter estimates for all the 
it ems in the pool which survived the calibration procedure are shox^ in 
Appendix Table A, along with the sources from which they were taken. Table 7 
shows the mean, standard deviation (S.D.), and range of values 'for each ICC 
parameter est;imated for tHe items in each content ?a^ea. The final line in 
Table 7 contains the same statistics, computed , for the 309 items in the 
final pool, ' i ' ' - ' 

As Table 7 shows, the mean dis.crimination (a) witMr content areas 
Varied from 1.09 to 1.32. The lowest a values were .63 and the highest was 
,4.68. The dif f icuj^ties within content areas were generally centered 
around zero, with the exception of content area 3, which had items of relative- 
ly high average difficulty (&=.92). The item difflcplties within content 
areas ranged from about -1.75 to about 2.50, witTi some differences among content 
areas. ^ The c parameters for the^e^our-choice items averaged between" .,24 and 
♦34 and* ranged from .00 to .65. * 
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Table 7 

Meani Standard Deviation, and Range of Item Parameter Estimates 
! hv Cnn^pnt Arpa for Total Item Pool 









Content Area 




Total 


Patametlr 
and Statistic 


1 


2 


3 4 5 


6 


Item 
7 Pool 



Number of 



Items 



a ([discrimination) 
Mean 
S.D. 
Low 
High 

h (difficulty) 
Mfean ' 
S.D. 
Low 
- High 

:d (guessing) 
-Mean 
■ ' S.D J 
' Low ; 
High! 



1.20 
.35 
•2.40 
.75 



■ -.24 
1.03 
2.48 

-1.76 

.28 
.09 

■ :5J 
.14 



47 

1.23 

.60 
3.54 

.67 

.06 
1.26 
2.49 
-1.7r • 

.25' 
.09 - 
- . 44 
.00 



29 


87 


36 


35"^^ 


37 


309 


1.32 


1.17 


1.26 


1.09 


1.16 


1.20 


.80 


.41 


.60 


.39 


.36. 


.50 


4.68 


3.66 


3.88 - 


2.03 


2.22 


4.68 


.65 


.63 


.73 


.63 


.63 


.63 



.92 
1.06 
3.02 
-1.56- 



.13 
.60 ' 
.00 



.17 
1.15 
3.21 
-1..80 

Al 
.12 
..65 
•^2 



' .15 

1.18- 
2.62^ 
-1.74- 



-.46 
1.29 
2.55 
-1.88 



..13' 
1.28 
2.70 
-1.-69 



.10 

,1.22 
3".2i. 
-],.88 



.32 


.24 


.29 


.29 


.14 


-;11 


.12 


.12 


.64 


.47 


.58 


.65 


.06 


.11 


.11 


■ .00 



Urry, (1977) has suggested the following guidelines, developed through a 
series oi simulation studies (Urry, 1971, 1977)r^to assure that an- adaptive , 
testing item pool will improve the quality of ability measurement: 

1. The a parameters of the items in the pool chould exceed ♦80« 

2. The h parameters of the items should be widely and evenly distributed 
• from -2.00 to +2.00. / . / " . • 

3. The c parameters of the items should be less than .30. 
A. There should be at least 100 items in^ the pool. 

. As. the data in Table A show, less than 12% 'of the items fell ,below .80 
for the a parameter. Table 7 shows that the average estimate of the a 
parameter was abo,ve 1.00 for_all content areas and 1.20 across all items in 
the pool^.. Thus, the vast majority of the items in this achievement test pool 
meet Urry's minimum criterion of a=*80. ' . " ' , ? 

The h parameter estimates in. this pool show the wide range suggested, in 
thfe guidelines, except for a slight deficiency of easy items. With the 
exception of" content area 3 and, to some extent, content area 6, the mean 
values of h were near zero; and the standard deviations were over 1.0. 
For the total pool^.mean h was ..10, and the range of.&*s was -1.88 to 3.21.^ 

- The c parameter estimaC'es averaged .29, narrowly meeting Urry's guide- 
lines; the o parameters of 140 items failed to meet the .30 cutoff. This 
failure^ was probably caused in part by the inherent instability of thee 
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parameter estimates, in part by the use of four alternative multiple-choice 
items (in which a correct response could be achieved by random guessing with 
p.«*25)» and in part by the requirement that a student omit five Items from 
each test* , TJie total parameterized It^m pool consisted of 309 items drawn 
from an initial pool of 394 unique dSems* 

Midquartev subpools. . The total item pool described above was used for 
the creation of two smaller pools • One pool (MQl) included all of the items 
from the first three content areas covered in the coarse; t"he other pool 
(MQ2) , included all items from. the fourth and fifth content areas covered. 
These jXwo subpools were also evaluated using Urry's criteria for adaptive 
testing item pools. / , 

■N « ' * 

table 8 • . ' / * ' ^ • 
Distribution of a and a Parameters.^oir SelectGd Ranges of , 
the b Parameter for Items in Each ofitffijo^J^idguarter Sub-Pools 

a fK V g - 



' iRarige of b'- No. of . Range , • Range-. 



Low 


High 


Items. 


• ftean- 


S.D. 


Low 


High 


Mean 


S.D. 


Low " 


High' 


-1.77 


-1.50 


* 8 


1.20 


.61 


.79 ' 


2.67 


*? — . 

.31 




.17 


- <i 
.56 


-1.49 


-1.00 


15 - 


1.^5 


.41 


:77 " 


2:49 


.27 


.11 


.14 


.51 : 


-.99 


-.50 


15 . 


. 1.23 


.29 ' 


.80 


i.81 


^ .24 


.08 


.16 . 


-.41 


' -.49 


.oa 


' 15 


1.32 


.56 


.65 


2/31 


.25 


.08 


\12 


.39.^ 


.01 > 


.50 


20 


i.09 


.29 ; 


.66 


1..66 


..27 


.09 


.13 


.54 ; 


.51 


1.00 


14 . 


1.1*4 


.30 


.71 


1.72 


.'33 


.09 


.12. 


.45 " 


1.01 


1150 


9 


"1.76 


1.18 


.89 


4.68 


' .35 


.17 


.00 


.60 


1.51 


2.00 


' 6 


1.32 


1. 10 


.68 


3.84 


; 2-5 


-.14 • 


' .00 


.38 


2.01 


3.02 


12 , 


1.28 


^ .70 


.67 


2.77 


f -35 


.09 


.17 


.52 


-1;77 


3.02 


114 


1.24 


.59 


.65 


4.68' 


^.28 


.11 


.06 


.60 



1.80 


-1.50 


8 


1.21 


.31 


.81 


1.58 


.33 


.15 


.2P 


.65 


1.49 


-1.00 


;l3 


. 1^17 


.26 


:79 


1.53 


^ -.26< 


.16 


.14 • 


.64 


-.99 


-.50 


22 


I2I 


.27 


.82 


1.79 


.27 


.13 


.i3 


.60 


-.49 


. .00 


^20 


.9i 


.27 


.63 


1.53 


.31: 


.12 


'•^^.12 


.53 . 


.01 


..50 


13' 


1.15 


^23' 


^.78 


1.57 


.33 


.11' 


.-12. , 


.56 - 


.51 


1.00 


ri9 


^1.18 


.33 


.65 


i 1.90 


2 31 


.08 


.19'^' 


'.47 


1.01 


' 1.50 


13 


1.04 


.31' 


' .68 


1 1.69 


.37 


.08 


.24 


.48 


1.51 


2.00 


6 


. 1.72 


1.21 


.89 


; 3.88 


.31 


.16 


.06 


.53 


2.01 


2.50 


' • 5 


^ 1.71 


1.16 


.81 


rG.36 , 


.37 


.11 


.24 


.52 


2.51 


3.21 


4 ' 


1.66' 


^54 


.95 


2.11 


.52 


.11 


.39 


.65 


•1.80 


3.21 


123 


1.19 


.47 


.63 


3.88 


' .32 


.13 


.06 


,65^ 



Table 8 shows the distributions of the threevlCC parameters for the two 
testing pools. As the "Total" lines in Table § show, discrimination para- 
meters (a) for the two"^ pools varied from .65 to 4.68 for MQl (114 items) 
and from .63 to 3.88 fpr MQ2 (123 litems) with means. of a=1.24. and 1.19, 

^respectively. In the MQl pool' 13% bf the items had devalues lass than .#80; 
In the MQ2 pool only 11% were below this value. .The b paramet.ers were centered 

"ground 0.0 for each pool (2?=. 18^ and .16) and ranged from -1^77 to 3.02 for .MQl 
and -1.80 to 3.21 for MQ2. Mean c parameters were .28 and .32,* respectively. 



Table. 8 shows that, in accordance with Urry*s recommendations, these 
pools had. difficulties which were generally rectangularly distributed, at 
least in the range o£ 2)==-1.50 to +1.50. ^There was a lack of easy items in 
both pools (&<1.50), and the MQ2 pool had. relatively fewer difficult items 
(2)>1.50) than did the MQl pool. Table 8 also reveals a tendency "for the 
higher difficulty items to also have higher discriminations. A positive 
correlation between, item difficulties and discriminations was, also reported 
in the context of ability measurement by McBride & Weiss (1974) and Lord 
(1975).. , There wasj no general tendency in these data for the c parameters 
■to covary with difficulty level, with the exception that highest average * 
values of q tended to occur for the most difficult items. - 

Similar to the total item pool, however, these subpools generally 
met Urry*s recommendations for adaptive testing item pools. Each pool 
included more 'than 100 items, most items had discrimination valxies greater 
than .80, item dif f icul^'ies were reasonably rectangularly, distributed and. 
wide-ranging, and typical values .were not unreasonably high. - , 

Conolus'ions ' ^ / * * . 

^It, is appar^ent from these data that a ,thrfee-parameter ICC model is 
applicable to. cdllege classroqm achievement test items. Almost 80% 
of the items in the initial pool obtained parameter estimates in usable 
•rahges..^,^ The resulting calibrated pool of items, as well as two subpools, 
met general recommendations for the construction of adaptive testinjg item 
pools in the ability testing domain. The subpools deviated somewhat, from 
these criteria ^in tetirfs of a lack of very easy and very difficult items, 
"as Well as in o parameters which were slightly higher than desirable. 
Whether .these high. ^? parameters are a result of unstable estimates, unique 
characteristics of the achievement testing pool, ot the testing instructions 
is unknown. Further research in other achievement testing contexts will 
be necessary to answer, this question. ^ • • 

•f' ' - , 

Dimensionality of the Item Pool . ' , ' 

• - ' '^^ ■ . ' ■ ' 

Traditionally^ the .hypothesis that a single factor accounts for per- 
formance on a set of test items has been investigated by examining the 
dimensiohail^ty of the matrix of inter-item tetrachoric .correlations by 
factor analytic methods (e.g., Indow & Samejima, 1^66; HcBride & Weiss, 1974; 
Prestwood &, Weiss, 1977). However, factor analyses of such matrices will, 
on occasion, ^result in more than one factor when only one dimension is present 
in the data;^";"""^^ 

' ^ " ' '-^ i / 

Bock and Lieberman (1970), for example, fitted a tv;o-parameter normal 
ogive model to a unidimensional set of five test items., The fit of the model 
(and, tl^erefore, unidimensibnality) was tested by comparing the observed and 
predicted' response frequency of every possible response vector. By this 
tes.fe t-he-|Widimensional model was found to, fit very well. However, factor 
analysis 'of the inter-item tetrachoric. correlation matrix rejected the 
hypothesis of a single factor* - ^ 

' - ■ • ■ ■ . . ' > : 
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Apparently, in the Bock and Lieberman data unidimensionality yas not 
evident in the factor analysis because of problems introduced by computation 
of the tetraphoric correlation coef f iciefit* Thus, in computing such a 
matri^^, irregularities may be introduced which prevent unidimensionality from 
emerging, e^en if it is present in the data* In the present study, there- 
. fore, the factor analysis was supplemented by additional analyses^ to further 
examine the unidimensionality of the data* 



Faotor Analysis ^ . - ^ ' 

Met nod. The factor analytic approach was used with two of^ the tests 
available: the first midquarter administered ^ in winter (Wl) and the second 
midquarter administered in spring (S2) . The first step of th^' analysis was 
to compute a 55x55 matrix of inter-item correlations. The tetrachoric routine 
in the Statistical Package, for the Social Sciences (SPSS; Niei Hull, Jenkins, 
Steinbrenner, & Bentj 1970) was used. Since students were in'structed to 
answer^ only 50 of the 55 questions, nhere was considerable n^n-systematic 
missin'g data. The program was instructed, to compute a correlation between 
any two test items, excluding cases for which the responses ^^to one or both 
items were missing (i.e., "pairwise deletion")* Since items were probably 
oinitjced on a non-random basis, an unknown amount of bias may have been 
introduced as a result of tHis^ procedure. ' " « ' / - t 



The r!esulting correlation matrices were factor analyzed by the principal 
axis, method. The initial, communality estimate for each item was chosen' 
to be the largest off-diagonal correlation, ^^hese estimates w^'i:e\^,then iter** 
ared (with a limit of 25 iterations) until the difference between" communal ity 
estimates on two succe^ssive iterations was negligible. The correlation , 
Matrices for^the two tests with iterated communalities are shown in 
/Appendix Table B. , ' " ' < 

. Following the procedures sugges.ted 'by Horn (1965) and used by McBride and 
Weiss (1974) and Prestwood arid Weiss (1977) to determine the number of 
factors in the real data matrix, a matrix of , random data for, 55 variables and 
1,000 hypothetical testees was generated. These random data were inter- 
correlated and factor analyzed employing , the same procedures as for the^^ 
two real data matrices. The eigenvalues from the random ^ata were used, to 
compare witb those of the real data in order to determine the number of 
factors in the real data. . ^ ^ 

Predictions about fhe factor structure to be obtained if the data ate 
unidimensional can be made in a manner parallel to that used by McBride and 
Weiss (1974). In this instance, the predictions to b,e made are as follows: 

1. The first factor extracted fr9m each of the real data sets should 

be a general unipolar factor; the random data set should not exhibit 
this factor^ ' " * ' 

2. ,A11- factors , other than the first*^f actor, from each of the real data 
sets should be of approximately equal magnitude and shoulif be 
bipolar (that is, chey should have as many negative loadings as 
positive loadings). ^ ' ^ 

3* All factors extracted from the real data, except for the first factors 
.should be indistinguishable from the factors extracted from the 
random data. ^ * ■ - ' . 

19 



-14- 



Results. Figure 1 shows the factor contribution (eigenvalue) plots 
for the two sets of real data and the random data. From this figure it 
can "be seen that both real data sets included a relatively strong first 
factor and that all of the remaining factors had low fartor contributions 
restricted , to a narrow range* It is also clear that the random data set lackr 
ed the strong first factor evident in the real data. Finally, all of the 
factors extracted from the real data, ,with the exception of the first f actor > 
had factor contributions that were verisimilar in. magnitude to the factor ^ 
contributions of the factors extracted' from the random data. The factor 
contribution data show that in the Wl data there was clearly one factor; in 
th^ W2 data there was a very strong £irst factor and a suggestion of two or 
three very weak secondary factors.- . " 

> . • « Figure 1 

Eigenvalues for Wl data, S2 Data and Comparable Random Data 

^ / ! 



13, 
8, 

4. 

2 




Wl 

— S2 
RANDOM 



-i 1 1 i 1 i • 1 — tr 

2 3 4 5 6 7 18 9 ' 10 

FACTOR ' . 



The first factor extracted from the Wl data accounted for 23.3% of the 
total v.ariance in the 55 Items with a icactor contribution .of 12.8; the first 
factor from the S2 data accounted for 24.4% of the total variance with a 
factor contribution of 13.4. No other factor extracted from' either the real 
data or the random data accounted for more than 4. 5%. of the total , variance 
of^ the test items. . - * . 



Table 9 reports the factor loadings from each of the three data sets 
for the first four factors extracted frbn: each matrix. The first factor 
obtained from each of the two real data sets had a large number of loadings 
which were higher than those in the random data; all these .high loadings 
were- unipolar. The first factor obtained from the random data was weak 
and bipolar. The second, <tthird, arid fourth factors obtained from all data 
sets were weak bipolar factors. Although the second factor from Wl had 
a^ factor contribution (1.96) indistinguishable from the corresponding factor 
(1.98) of the random data, it had two loadings which were higher in absolute 
value than those of the random data. Factor 2 from S2, which had a factor 
contribution (2.A9) slightly higher" than that .of the random data (1.98), 
had three loadings greater than the highest in the random data. For 
factors 3 and 4 the factor contributions for the Wl data (l.Sl^and 1,75, 
respectively) were lower than for those of the random data (1.90 and 1.83); 
for the S2 data the corresponding factor contributions were higher (2.24 * 
and 2«22). None of the loadings of the Wl factors 3 arid 4 exceeded the high- 
est loading in the randpm data, while two of the S2^ loadings on factor 3 
'an^ ^e loading on ,f actor 4 exceeded the corresponding random data Iqadings 
*in abs'blate value. * " * \ / 

Thes^e results suggest that factors 2, 3, and 4 fronrS2 and Wl are similar 
to factors of random data and , in all probability, represent trivial factors'. 
In general, th^n, these results tend to support the existence of a single / 
major factor in these achievement test <iata. / j 

Equality of ICC's Based :on Content Areas anfl Total Test 

Rationale. In addition to implying tnat there is one factor in the item 
responses, the assumption of ur^tJimensionality Implies that ICC's will be 
linearly related across samplefr'^of item/ from the same domain of content. 
One way to examine this assumption is to compare the i«..*^'s based on the total 
set of 55 items within a given Tjiidquarter with the ICC^s .computed within 
the cont.ent areas comprising that midijuarter. , If the total test. measures a 
single dimension, parameterization ox items within content areas. shoul4 
result in ICC parameters which are^nighly correlated with those obtained 
across all content, areas* If this result is not found, it can be concluded . 
that the cpntent area is measuring a dimension which is not predominant in 
th^ total set .of items and that ^he test items are not unidimensional. 

-. V /..,',/ -. ■ • ' ■ ■ 

A more stringent criterion for unidimensionality is 'that the item para- 
meter, estimates for items paraJmeterized within a'^content area should be 
numerically the same as the parameter estimates obtainea for thqse same 
items when all the content areas are-.callbrated together. This , is equivalent 
to saying that the metric defined by items in a given content area is inter- 
changeable with the metric defined by all the items. This criterion of .1 
unidimensionality implies that 1) the regression of the two sets of parameter 
estimates should be linear; 2) the. slope of^ the regression line should be 
1.0 within sampling error; and 3) -the -futercept of the regression line 
should be 0.0. ; . . . 

Method. Using Urry*s ESTEM item calibration program, ICC item parameter 
estimates were computed within each content area for each of the four mid- 
quarter examinations ♦ Item parameter estimates within content areas (shown 
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♦ _ Table 9 

Unrotaccd Factor Loadings for the First Four Factors of 
Wl iJata, S2 Data and Comparable Random (Ran) Data 





Factor 1 




Factor 


2 


Factor 


3 


' Factor 


4 


' Item 


Wl 


S2 


Ran 


Wl 


S2 


Ran 


Wl 


S2 


Ran 


Wl 


S2 


' Ran 


1 / 


* .27 


.46 


-.06 


.13 


.10 


.07 


-.09 


.09 


.05 


-.09 


-.04 


-.06 


"2 


.43 


♦ A3 


.39 


.12 


.05 


.02 


.13 


.07 


.10 


. .02 


.06 


.00 


3 


.48 


.37 


-.28 


-.40 


-.08 


-.09 


.05 


.04 


. .28 


-.03 


-.06 


-.10 


'4 


.50 " 


.48 


-.02 


-.01 


.05 


.16 • 


-.17 


-.07 


.17 


-.03 


-.*10 


.00 


'5 


.43 


.53 




-.36 


.12 


.20 


. .Q9 


.08 


-.1? 


. -.17 


-.23 


-.14 


6 


. .26 


.59 


-.11 


-.15 


.08 


.08 


.16 


-.12 


.13 


.04 


-.11 


-.12 


7 


.58 


.06 


.00 


-.02 


-.09 


.11 


.11 


-.12 


-.01 


..00 


-.08 


-.20 


8 / 


.58 


.53 


-.09 


.08 


.13 


-.05 


-.12 


-.06 


.01 


-.03- 


.14 


-.14 


9 


.51 


.55 


• .06 


-.07 


.09 


.07 


-.18 


-.12 


.26 


.12 


-.42 


-.03 


la 


. .63 


• 61 




.02 


.08 


.04 


"-.23 


.03 


.19 


-.11 


-.70 


-.01 


- -11 


.55 


-.04 


.08 


.02 


-..13 


.03 


.00 


-.37 


-.03 


.07 


-.25 


-.05 


12 


.55 


.50 


.00 


.05 


.23 


,00 


.05 


-.04 


.08 


.16 


-.14 


.00 


13 


.54 


.53 


.12 


-.02 


.27 


.20 


^.17 


.09 


.16 


-.23 


.07 


.00. 


' 14 • 


^.48 


.17 


.12 


-.4S 


.18 


.06 


-.31 


-.19 


.i2 


• .03 


.10 


.10 


15 • 


.22 


.45 


.13 


.14 


. 17 


-.12 


-.02 


-.04 


.06 


■ ' .04 


-.i08 


-.02 


16 


.28 


.47 


-.16 


-.01 


.25 


.05 


-.08 


.09 


.17 


.03 


.11 


..07 


17 


• .47 


.55 


.24 


.09 


.32 


-.01 


-.03 


-.04 


-.09 


.09. 


.04 


.P6 


18 


,66 


.66 


.06 


> .10 


.27 


-.18 ^ 


.07 


.11 


-.03 


.05 


-.02 


-.06 


19 ' 


' .58 ' 


.5.9 


-.02 


.08 


.25 


-.27 


-.09 


-.12 


-':09 


^ -.11 


.03 


.08 


20 


. .28 


.50 


-.03 


.10 


.21 


.00 


.19 


.04 


.09 


.16 


.10 


M7 


21 V. 


.33 


.51 


-.15 


■-.03 


•3> 


.09 


-.i3 ^ 


-.21 


-.02 


.17 


.07 


.02 


22 


.41 


.46 . 


.04 


.17 


> .27' 


.14 


-•19 


-.03 


-.02 


.10 


.12 


-.10 


. 23 


.41 


,50 


;06 


.22 


-.0? 


.25 


-.01 


-.01 


-,01 


7.16 


T.18 


-.18 


24 


.37 


-.49 


-.06 


.12 


-.14 


.01 


.06 


.03 


.05 


-.08 


.07 


.06 


25- 


. .38 


.40 


-".03 


-.13 


.00 


.17 . 


.07 


.00 


.11 


-.13 


.24" 


'-.10 


26 


.54 


.49 


-.13 


-.26 


-.04 


.08 


-.17 


-.08 


.03 


.29. 


.27 


.02 


27 ' 


..59' 


.15 


-.30 


. '-.14V 


.08 


.14 


.20 


.n 


-.08 


-.20 


.07 


.26 


28 ' . 


.59 


.46 


.00 


.04 


.14 


-.13 


.19 


-.04 


.24 


.21 


-.11 


-.11 


29 


.34 


.35 


.27 


.15 


.07 


-.15 


.22 


.13 


-.01 


.08 


.22 


.34 


30 


.49 ^ 


>- .62" 


-.04 


.02 


.03 


-.02 


.10 


-.08 


-.06 


-.23 


.03 


.02 


31 


.50 


.64 


.02 


-.08 


-.07 


.09 


-.28 


-.20 


.09 


-.15 


.16 


.14 


32 


.65 


.32 


.21 


.05 


-.02 


T.IO . 


.03 


-.07 


-.02 


.17 


.1j 


.06 


'>3 


.38 


.34. 


-.18 


' .13 


.10 


-.19 


-.12 


.14 


.08 


-.24 


' .25 


.06 


34 


.64 


.64 


.04 


-.05 


-.12 


.16 


.10 


.14 


-.03' 


-.15 


.14 


.01 . 


35 


' .44 


. .63 


.15 


.22 


-.09 




-.19 


.13 


.15. 


.10 


.03 


.21^ 


36 


.34 


.46 


.15 


.18 


-.07 


.if 


-.08 


.10 


-.04 


-.28 


.18 


.17 


37 


.66 


.47 


-.07 


-.07 


-.30 


-.20 


.08 


.12 


^06 


.02 


.24 


-.02 


38 


.46 


.47 


' .07 


-.09- 


.08 


-.10 


.11 


.14 


-.03 


.03 


-.09 


.07 


39- 


.28 


. 19 


-.09 


.07 


-.04 


-.38 


-.08 


.01 


.02 


.02 


-.09 


.04 


40 > 


.49 


.65 


.12 


-.06 


-.13 


.20 


.44 


-.04 


-.01' 


-.12 ' 


.06 


.04 


41 


.47 


.55 


.00 


-.16 


-.10 


-..U4 


.02 


.02 


.19 


-.05 


-.10 


.12 


42 


.30 


.49 


.04 


.07 


-.08 


.11 


.12 


-.22 


-.06 


.07 


-.16. 


.02 


43 


.49 


.56 


-.03 


-.27 


^^-.08 


.08 


.16 


.08 


-.04 


-.17 


.14 


-.30 


44 


.63 


.56 


-.06 


.16 


-.54 


-.12 


-.03 


-.65 


.42 


.13 


-.08 


^.27 


.'V 45 


.57 


.32 


-.04 


.07 


.13 


.12 


.00 


.19 


-.26 


.10 


.22 


.04 


: * \ 46 


.68 


.37 


.42 


.13 


-.05 


.-.U8 


.00 


-.28 


-.03 


.04 


.16 


-.06 


A 7 


.32 


.36 


-.07 


— .03 


-.08 


- . Uo 


.06 


.07 


.03 ^ 


.08 


.03 


. 28 


. 48, . 


.27 


.38 


.21 


-rl7 


-.02 


-.01 


-.23 


-.03 


-.10 


.25 


-.14 


-.18 


49 


.-27 


.32 


.13 


.02 


-.06 


-^34 


.10 


-.31 


-.08 


.22 


-.18 


-.29 


50 


.50 


.53 


.35 


.11 


.06 


-.04 


-.11 


-.14 


.15 


-.20" 


-.16 


.00 . 


Si 


.08 


.55 


.02 


.12 


-.46 


.28 


.04 


-.48 


.23 


-.02 


-.16 


;21 


52 


♦ 40 


.60 


-.21 


.20 


-.36 


.02 


-.09 


-.38 


-.07. 


-.11 


.00 


-.05 


53 


.42 


.59 


-,17 


.27 


.-.52 


-.14 


.06 


-.37 


-.14 


.06 


.07 


.11" 


54 ; 


. .52 


.48 


-.08 


-.07 


-.18 


.03 , 


.18. 


-.02 


-.36 


.10 


.11 


-.06 


55 


.37 


.47 


• .07 


-.03 


-M2 


.13 


.04 * 


-.06 


.03 


. .26 


.17 


.08 


Factor 


























Contribut ion 


12. g4 


13.44 


2.11 


1.96 


2.49 


1.98 


1.81 


2.24 


1.90 


1.75 


2.22 


1.83 ^ 
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In Appendix Table C) were then correlated with those determined eatlier 
using all the items in each examination* Item parameter estimates for content 
area ICC^s and total test ICC 's were correlated for the a and b parameters 
separately and within each examination. The significanca of linear and 
polynomial trends was also tested in these data using program BMD02V_from 
the Biomedical Computer Program Package (Dixon, 1975)/ In addition, the 
slope and intercept of the regression lines were determined and tested for 
statistical significance. Because the o parameter was poorly estimated by 
Urry's program with the numbers of testees and items available in this study, 
these analyses were confined r.o the a and b parameters • 

Results. Fifty-one items^were rejected, using the criteria in Urry's 
calibration program. Approximately half were ex<:luded by the program in 
both ~ the total test calibration and the content ^rea calibration. Only one 
item, was excluded in the content area calibratiOT that was not excluded in 
the total test calibiration. 

Table 10 shows the Pearson product-moment correlations of the a para- 
meter estimates for the content areas and the. total test. It also shows the 
significance Tevels of the first through fourth degree pqlynomiaT^ in the 
prediction of the a parameter estimates for items in each content area by ' . 
the total test a parameters.. Correlatif^ns varied from .18 to .95. These 
linear trends were statistically significant (p<.05) in 7 of 10 instances. 
As Table 10 and Append Jab le D show, non-linear quadratic trends were 
significant in only twcJ iSstancqs; none of the cubic and quartic trends 
were statistically significant. In test SI there was no significant relations- 
ship between the two sets of parameters for content area 3; it \<as the only 
content area which did not exhibit a significant trend in one of!^\the two 
quarters. ' ■ "^"'^^ ' 



Table 10 - . ' ' 
Product-Moment Correlations and Level of Significance for Polynomial* 
Trends in thev^rediction of Content Area a Parameter Estimates From 
Total Test -g Parameter Estimates for Four Tests 



Test 



Content No. of . Significance of Polynomial Trends 



Wl 



L 

S2 



Area 


iCems 


r 


Linear 


Quadratic 


Cubic 


Qu^artic 


1 


13 


.69 


p<.005 


. NS* 


NS ^ 


/' 

{ , NS 


2 


18 


.77 


.001 


NS 


NS 


NS 


3 


10 


.24 


' NS 


.05 


' NS" 


NS 


1 , 


12 


.43 


NS ■ 


.05 


NS 


NS 


2 


14 


.72"^ 


.005 


NS 


.NS 


. NS 


3 


9 


.18 


. NS 


NS 


NS 


• . NS 


4 


31 ■ 


^93 


.001 


NS 


NS 


NS 


5 


11 ■ 


.'86 


.001 


NS- 


NS 


NS 


4 


30 


.93 


.001 


NS 


NS . 


NS 


"5 


12 


■ .74 


.01 




NS 


NS * 



NS indicates that the polynomial was not statistically significant 
• at the .05 level. Significance was determined by the' use of an 
F*-statistic. The sums of squares used for calculating the F-value 
are 'shown in Appendix Table D. 
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Table 11 shows the correlations and tests of polynomial trends for the 

b parameter* These correlations ranged from to ^99; all but two were. 

.94 or above. Table 11 and Appendix Talile E show that the linear trends for 

ail 10 instances were signi'ficant at the p<»001 level* None of the non-linear 

trends were statistically significant* 

. s 

Table 11 ^ ^ . 
Product-Moment Correlations and Level of Significance for Polynomial, 
Trends in the Prediction of Content Area b Parameter Estimates From 
^ Total Test b Parameter Estimates for Four Tests ^ 



Test 


Content 
■ Area 


No. of 

Item.*!*' 


r 


Signifl 


cance of Polynomial Trends 


Linear 


Quadratic 


Cubic 


Quartic 


Wl 


1 


13 


.99 


.001 


NS* 


"ns 


NS 




2 


17 


.94 


^0,01 


NS 


NS 


NS . 




,3 


10 


.95 


.001 


NS 


NS 


NS 


SI 


1 


12 


.98- 


.001 


NS 


NS-"! 


NS 




2 


14 


.99 


.001 


-NS 


■ • NS 


• NS 




3 


.9 


.91 


.001% 


NS 


, NS : 

• 


NS 


W2 


4 


31 


.97 


.001 • 


NS ' . 


NS 


NS 




\' 5 


11 


.98 


;001 


NS 


NS 


NS 


S2 


4 


' 30 ■ 


.99 


.001 . 


NS 


NS - 


NS 




5 


12 


.86 


.001 


"NS 


■'ns 


" NS 



VS indicates that^the polynomial was hot statistically significant 
at the •.05 level. Significance was determineid by the use of an F- 
statistic. The sums of squares used for calculating the F-value 
are shown in Appendix Table E. " r ' 

\ V ; . 

. The. data in Tables 10 and' 11 show that the relationship between the 
ICC item parameters computed within content areas and those computed when 
the items Were embedded within the total test were linear for the b para- 
meter and primarily linear for the a parameter. The data from the spring 
quarter tests tended not to fit the predictions as well as that from the 
winter quarter tests, since there was no significant relationship in the 
,a parameter data for content area SI. This is the same con'tent area which 
also had one of the lowest correlations in the parameter data. , 

Strong inferences concerning the uniditnensionality assumption can be 
drawn from an examination of the slope and intercept of the regressions of 
.the content area and total test ICC parameters'. These data are shown in 
fable 12. The results for the* slope of the a (discrimination) parameter were 
in accordance with the prediction of slope of llO in only one instance* 
The intercept of the a parameter exceeded twice its standard error in only 
three of the ten instances. 

, * * 

For the b parameter. Table 12 shows that the slope of the regression 
line deviated significantly from its' predicted value in content area 3 for 
Wl and Si and content area 1 for Wl; the remainder of the slopes did not 



A 
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Table 12 

Slopes and Intercepts and Their Standard Errors (S*EO for the 
Bivariate Regression of Content Area Item Parameters and Total 
; ^ Test Item Parameters 

Test and I~ Z 

content No. of r Intercept 

Area Items Slope S.E^ Pred. Int, S.E. Pred, 



2-l-i5SEi2i25Ei22l-?t^2^®^®^ 

Wl . 



1 


13 


.54 


.17 


N 




.43 


.30 


•Y 


2 


15 


.56 


.11 


N 




114 


.19 


Y 


3 


8 


.20 


. .22 


N 




" .67 

** 


.40 


Y 


SI 
















1 


12 


.13 


.09 


'n 




.83 


.15 


N 


2 




.77 


. .21 


Y 




-.16 


.36 


Y 


3 


7 


.15 


.23 


N 




.76 


.47 


Y 


W2 




/, 










' 




4 


29 . . 


^82 


.07 


N 




.12 . 


.^09 


Y 


"5 


19 


.51 


.10 


N 




.31 


.17 


Y 


S2 










^ 








4 


• 30 


.37 


c.15 


N 




.63 




, N 


, 5 


12. 


.22 


.06 


N 




.66 


.10 > 


N 






b 


[difficulty) Parameter 








Wl ; 


















1 


13 


.94 


.03 


N 




.00 


.03 


Y 


2 


15 > 


1.08 


.06 


Y 




-.41 


l09 


N 


3 


8 


.73 


.08 


N 




.46 


.13 


N 


SI 


















. 1 


12 


1.03 


.07 


■Y 




-.16 


.08 • 


Y 


, 2 


14 


. .93 


. -04 


. Y , 




-.31 


-.06 


N 


3 


7 


.72 


' .12 


N 




.11 


.20 


Y 


W2 


















"4 


29 


.97 


.05 


Y 




-:07 


.06 


y' 


5 


' 19 ■ 


•97 


.06 


Y 




.01 ■• 


-.07 


Y 


S2 


















4 


30 


. 1.05 


.07 


Y 




.06 


.07 


Y 


5 


12 


.77 


• .14 


Y 




-.21 


.13 


Y 



Y indicates that the value of the slope was as predicted, i.e., did not 
differ from the predicted value of 1.0 by more than twice. if s standard 

terror; N otherwise." ^ 
2 

Y indicates that the value of the intercept was as predicted, i.e., did 
not differ from the predicted value of 0.0 by 'core than twice its stan- 
dard error; N otherwise. 
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dtffer from 1.0 by more than twlc6 their standard errors. The intercepts 
for thei, b parameter deviated significantly from zero for content areas^ 2 V 
and 3 in Wl and content area 2 in W2. There were no deviations from, the 
predicted values for either slope or intercept of the b parameters for the., 
second examination (W2 or S2). : . 

ConaluHona 

The factor analysis strongly supported the belief fchat only one real 
factor ^was present in efch of the two tests analyzed. Every other factor 
fell at or near the level of the factors extracted* by the same methods from 
random data ^d had loadings which were' largely similar to those in the 
random data. * ' 

' The analysis of the ICC parameters estimated in the contejct of the total 
test and individual content areas' also'lent credence to the hypothesis of 
unldimensionality. Although there were some deviations from predicted 
relationships 9 content area estimates wei^e primarily linearly related to 
total test parameter estimates. The regression slopes and intercepts 
tended to follow the predicted patterns, particularly for the b parameter. 
For the a parameter the 
the predicted pattern, 

predictions ,for the intercept ofi^'t^ie regres$j 



;he slope of the regression ^dld not generally follow 
iy but thelrest|lts were generally in Accord wi^hjhe ^ 
Intercept ofi^'the regressions. . * ^ (^5:. 



' ^ ^Thus, even though therd-weite^ some deviations from strict unidimen- 
sionality, the two types of' evidence indicate that the assumption of essential 
unidlmensionality is valid. ^ ^ ^ ^ 

ScvrrpHm InvarianQe of Item Pavametev ^Estimates 

' * According to Lord and Novick (1963, p. 380), ICC item parameter estimates 
determined in two subgroups are invariant if : ^ ^ • 

' 1. the regression of the b parameter estimates for two population sub- 
groups '.Is linear with a^. slope equal to a^(e)/a2(6), where a^(6) and^ 

02(0). are the Standard deviations of in the two population sub- 
groups, and the intercept is equal to the difference in the mean ability 
> level between the two grqups 

2. the regression for the a parameter estimates is also linear and^ has 
a zero intercept, and the slope is* equal to 0^{Q)/o^(Q) . 

Similar predictions could be made for the a parameter. However, similar to 
the previous analyses, these analyses of sampling invariance were confined to 
the a and b parameters and were not applied i:o the c parameter. 

MetJtod ^ ' . ^ , 

In the" two quarters used for item calibration, 46^items were administered 
to two different groups of students. Since th^se items were administered to 
different groups ;in the context of different tests, a comparison of ' the para- 
meters obtained from the two calibrations of these items will serve as a strong 

26 ' 
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test of the invariance of the item parameters ♦ If invar^iance is observed, 
it can b^v interpreted as additional evidence for the applicability of , ICC 
theory in an achievement measurement setting ♦ , ^ . 

Of. the 46 items which had been administered to two groups of students, 
25 items were used by the sampling inVariance analysis* Items were included 
ih tha analysis if they had been administered at the same point in the course 
during both quarters (e»g», items administered at Wl and SI of WF and SF 
were used. Whereas an item administered at Wl knd SF was jiot used) . 

. For each item administered > item parameter estimates were obtained in eac 
of the samples within the context of the calibration of the total set of 
items.^ Parameter estimates obtained frpm the second administrations were- 
regressed on those obtained from the first administration; these* regressions 
were tested for polynomial' trends. In addition, ^he slopes and intercepts 
of the regression Equations were compared with predicted valuer. 

" Table. 13 ^ * 

' - . ^ Pa^rameter Estimates. for Items Used j , 

. in Study of .Sampling Invariance I - , 

' ^ ^ First Administration Second Administration L 



■ Item 
Number 


Test 


" ■ Parameter 


Test 


Parameter 


a 


b 


a 


b 


3002 


WF 


.82 


.13 


SF 


.87' 


.■12 


3034 


Wl 


1.01 


.37 


SI 


■ .85 


-.29 


3038 


Wl 


1.58 


.-.56 


. ?1 


1.20 


-1.06 


320i 


Wl 


1.07, 


-1.34 


SI 


.85 


-1.74 


•5>3206 


, Wl 


.74 


1.5i 


SI 


.75 


1.57 


3216 


'Wl 


1.27 


-.62 ■ 


SI 


1.17 


-.60 


3218 


Wl 


".82 


.58 


SI 


.80 


' .34 


3229 


Wl ^ 






SI 






3237 


WF 


1.54 


-.37 


• SF 


. 1.58 


-.11 - 


3241 


Wl 


. 1.12 


2.48 


SI 


.91 


. 2.09 


3243 . 


Wl. 






SI 






3414 


.Wl 


.88 


2.29 


"si • 


1.40 


1.96 


'3612 


WF • 






SF 


1.12 


.75 


3651 


W2 


' I .81 


2.27 


S2 


.95 


2.31 


3812 


W2 


^ .74 


-.66 


S2 ■ 


.82 


-.63 


3909 


W2 


1.34 


.77 


S2 


. .90 


1.12' 


4005 


WF 






SF 


1.23 


2.76 


4006 


WF 


.84 


-.59 


SF , 


1.05 


-.19 


4025 


WF 






SF ' 






4026. 


WF 






SF 






4036 


WF 


1.24 




SF 


.95 


-1.30 


4044 


WF^ 


. .80 


-.12 ■ 


SF 


.80 


-.60 


4203 - 


wf' 






SF 






4229 


WF 


1.36 


-.45 


SF 


1.64 


-.92 


4238 


WF 


.83 


1.54 


SF 


. . ..83 


1.47 



Note. Blank Item parameters indicate that the itpm 
was rejected by the parameterization program. 



Results . " . ^ 

The itecs used in this phase of the analysis and their parameter 
estimates. are shown in Table 13; these items had a fairly representative 
range of a and b values and included items from each content area. Of the 
25 ^tans available, seven were rejected by Urry^s exclusionary criteria in 
one^of the two groups* Five of these ^items were rejected at both calil)rations* 

c Figure 2 shows a plot of the a parameter estimates obtained for the *18 
items for which parameter estimates were available both quarters; Results of ^ 
the linearity test are in Table 14 ♦ As Figure 2 shows, the slope of the 
linear regression line was .61 with a standard error of .19. The predicted 
v^lue of the slope of the linear regression was .97,. based on the ratio of 
the standard deviations of the total test 9 estimates .obtained in the winter 
and spring quarter data. Thus, the slope did not deviate from its predicted 
value by more than twice its standard exror. The intercept of the regression 
line was .38 with a standard error .of .21; it, tooi did not deviate from its 
predicted value (0.0) by mare than twice its standard error,. * 



. Jigure 2 

Plot of a Parameters of Items Calibrated Twice 
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1.40^ 



1.30 



1.20 




1,10. 
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The data shown in *Table 14 indicate that the regression of the two sets 
of parameter estimates was linear* The Pearson products-moment r of .63 was 
statistically significant at p<.005; none x>f the curvilinear trends was 
statistically significant* ' . 

^ Table \4 • 

. Product-Moment Correlations and Level of Significance of the Con- 
tribution of Each Term of a Fourth Degree Polynomial Expression" to 
the Prediction of the a and b Parameter Estimates Obtained During 

* Spring Quarter Testing from Those Obtained During Winter Quarter 



Testing 



> > 




Sign 


if icance of Polynomial 




Parameter 


V 


Linear 


Quadratic 


Cubic . 


Quartic 


a 


.63 


.005 


NS* 


■• NS - ' 


NS 


b 


.96 


.001 


,NS 


NS - 


NS 



NS signifies that significance level of p=.05 was not attained. 



Figure '3* shows the biyariate plot of the h parameter estimates for the 
data from the two quarters, the linear regression line fitted to these point 
had a slo^pe of 1.02 ^ith a standard error of .07. Thus, it did not differ 
from its predict-ed value of .97 by more than twice j^s standard error^ The 

Figure ^3 

Plot of h Parameters of Items Calibrated Twice 
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mean differences in 9 estimates obtained from the winter and spring groups 
was -.09/ The intercept of the regression in Figure 3 was -.18 with a standard 
error of .08. Thus, the observed slope for the b, parameters did not differ 
from the' predicted slbpe by more than twice its standard error. 
• I. ' . 

i . . . 

As shown in Table 14, the linear, correlation between the two sets of 
parameter estimates was .96, which was highly si^ificant; none of the 
non-linear trends was statistically significant. * 



Conclusions 



These results strongly support the invariance characteristics of the 
a and b ICC parameters across subgroups from the same population. Results 
for both parameters showed, linear relationships between the parameter 
estimates derived in .two samples of persons, when the items were in the 
cot^text^ of different subsets of items in each sample. In addition, the 
results from the linear regression met the strong criteria of sampling in- 
variance predicted by the ICC model. These results .strongly suppdrt the 
application of the ICC a and b parameter. estimates in an achievement testing 
context. • . ' vj , 



. ^ Conclusions . 

Answers can now be given to the questions which guided this research: 

1. Do achievement to^t item pools permit calibration by ICC mpdels and 
result in an item pool suitable for adaptive testing? ^ 

Of the 394 unique items, 309 survived ICC calibration procedures to 
form a total pool of wid'e-ranging jiif ficulty with moderate tjo high . 
discriminations. Except for the high values of the e parameter, 
this pool met and exceeded reasonable standards set for "an item pool 
for use in adaptive testing. The two midquarter examination subpools 
also were suitable for adaptive testing. The two pools contained 
'114 and 123 items with mean a-values of 1.24' and 1.19, respectively. 
Difficulty (2?) parameter values were relatively rectangularly dis- 
tributed in the range o5 -1.75 to about +1.75; items , were also, 
available with b values 'as high as 3*21. However, there was a lack 
of items .in the very low difficulty r^nge. 

2. Are resppnses to achievement test items reasonably unidimensiqnal? 

Both the factor analytic study and the study of item parameter 
estimates for content areas and the-total test support the uni- 
dimensionality assumption. There was some indication that deviations 
from unidimensionality existed in the data, but they appeared to be 
, minor compared to the major factor in the data. 

3. ' Do item parameter estimates remain invariant across samples? ^ 

Both the a and b parameters were consistently estimated apross^two 
samples. Both met strong criteria of invariance in terms ,of linearity 



r of the estimates and' predicted ^values of the regression slopes and 

Intercepts. These results are particularly meaningful * considering 
that the It^ns studied appeared In the two tests In the context of 
xtther Items which Were not generally the same In both gifoups of , 
students. 

* . The primary results of these, studies indicate that ICC theory can be 
applied to a classroom achievement test item poolZ This Is'^an extension of 
the application of ICC theory, wKlch has been primarily limited to. a1)lllty 
testing until now. If these results repXlcane in other areas of. the $ichleve- 
ment tfestlng domain, it will be possible to link ICC theory with computerized 
adaptive test administration. This ^combination will yield a more thorough • 
and efficient system for measuring achievement and for evaluating the 
effectiveness of training programs. / 

^ / ^ ' ' 
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